200826 Expt.296: Sequence-Agnostic RNP Purification

NB: This demonstration analysis was performed on unpublished development data. The original experiment was designed to investigate the conditions under which RBP-RNA complexes could be purified on silane. The data shown here is only one part of 4 datasets in the experiment (including the protocol which worked)

Aims:
A. To investigate different buffer compositions, temperature and elution conditions that may reveal a capacity for the purification of RBP-RNA complexes via selective precipitation.

Method:
See associated writeup for Expt.296. There are 4 MQ datasets for this experiment totalling n=16 tested conditions, in this tutorial we will work with one dataset encompassing n=4 conditions.
In this particular experiment the sample names refer to:

  • 20/30 = 20% or 30% ethanol in the capture and wash media
  • heat/obd = heated elution at 55deg vs on-bead digest. In this case the heated eluate was first taken, followed by the on-bead digestion ON THE SAME SAMPLE.
  • pH7 = These samples were processed under pH7.8
  • Silane = the bead chemistry used, in this case MyOne Silane Dynabeads

1. Import Modules and Files

Custom Functions
jwrangle.importMixedFiles( )

I generally import everything I MIGHT use at the start and set up pathing using the OS-agnostic pathlib.

In [1]:
#### Import necessary modules
import os
import pandas as pd
from pathlib import Path
import copy
import jwrangle
import jvis
import jinspect
import jtest
import jweb
from imp import reload
import upsetplot as upset
from Bio import SeqIO
import altair as alt

import numpy as np
import preprocess
import seaborn as sns

#### define working directories
cwd  = Path(os.getcwd())
base_path = Path(os.path.join(*cwd.parts[:cwd.parts.index('experiments')]))

#### MaxQuant proteinGroups & evidence files
MQ_folder = jwrangle.importMixedFiles(cwd / 'MaxQuant')
MQ_folder.keys()

pGroups = MQ_folder['proteinGroups.txt']
evidence = MQ_folder['evidence.txt']

#### Inspect MQ setup
MQ_folder['parameters.txt'].head(9)
C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3242: DtypeWarning: Columns (53,54,61) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3242: DtypeWarning: Columns (321,322) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
Out[1]:
Parameter Value
0 Version 1.6.7.0
1 User name smith.j
2 Machine name MSCYPHER-253
3 Date of writing 08/26/2020 20:00:00
4 Include contaminants True
5 PSM FDR 0.01
6 PSM FDR Crosslink 0.01
7 Protein FDR 0.01
8 Site FDR 0.01

2. Metadata Creation

Custom Functions
jwrangle.MQ_writeMetadata( )

Metadata tabulates the test conditions for ALL experiments that shared the same MQ search and thuss all experiments that comprise the MQ outputs. Metadata can also be done in a spreadsheet program. Here, I have created my metadata programmatically instead (I simply find this easier).
The metadata table gives users the opportunity to rename samples and define the experimental parameters for the data. This task can be expecially complex for MaxQuant because a unified output is generated even if distinctly separate experiments are searched as a batch and with different parameters applied.
The function jwrangle.MQ_writeMetadata( ) will take a metadata table, rename all samples in the proteinGroups and evidence files, assign alternative filenames, and save new copies to be used in future analyses.

In [2]:
#### Inspect column names
colnames = list(pGroups.columns.values)

#### Derive experiment names as a list
#### I gave the original samples some irritating names that I'd like to correct
experiment_names = []
for i in colnames:
    if 'Intensity ' in i:
        experiment_names.append(i.replace('Intensity ', ''))

new_expt_names = []
for i in experiment_names:
    i = i.replace('pH8S', 'pH7-S')
    i = i.replace('ne', 'ne-')
    i = i.replace('NC', 'nCL')
    i = i.replace('PC', '254')
    i = i.replace('heat', '-heat')
    i = i.replace('obd', '-obd')
    new_expt_names.append(i)
    
#### Create a list of associated conditions
#### These samples were mislabeled 'NC' and 'PC' (the latter is often short for PAR-CL, but these samples are 254nm crosslinked)
condition =  []
for i in new_expt_names:
    trim = i[3:-2]
    condition.append(trim)

#### Create a list of associated replicate identifiers
replicate = []
for i in new_expt_names:
    replicate.append(i.split('-')[-1])

#### Create a more reader friendly list of sample names
samples =  []
for i in new_expt_names:  
    samples.append(i[3:])
    
#### The MQ_groups dictionary assigns each condition to its MQ group parameter
MQ_groups = {'20silaneheat':['pH7-Silane-20-heat_nCL','pH7-Silane-20-heat_254'], 
             '30silaneheat':['pH7-Silane-30-heat_nCL','pH7-Silane-30-heat_254'],
             '20silaneobd':['pH7-Silane-20-obd_nCL','pH7-Silane-20-obd_254'],
             '30silaneobd':['pH7-Silane-30-obd_nCL','pH7-Silane-30-obd_254']}

#### We then use the condition list to create an MQ_groups list
Grp_parameters = []
for label in condition:
    for key, value in MQ_groups.items():
        if label in value:
            Grp_parameters.append(key)

exp_df = pd.DataFrame(
    {'experiment': experiment_names,
     'condition': condition,
     'replicate': replicate,
     'sample':samples,
     'measure':['Intensity']*len(samples),                  # adding this column allows our metadata file to be compatible with Proteus
     'MQgroups':Grp_parameters
    })

exp_df
Out[2]:
experiment condition replicate sample measure MQgroups
0 17_pH8Silane20heat_NC-A pH7-Silane-20-heat_nCL A pH7-Silane-20-heat_nCL-A Intensity 20silaneheat
1 18_pH8Silane20heat_NC-B pH7-Silane-20-heat_nCL B pH7-Silane-20-heat_nCL-B Intensity 20silaneheat
2 19_pH8Silane20heat_NC-C pH7-Silane-20-heat_nCL C pH7-Silane-20-heat_nCL-C Intensity 20silaneheat
3 20_pH8Silane20heat_NC-D pH7-Silane-20-heat_nCL D pH7-Silane-20-heat_nCL-D Intensity 20silaneheat
4 21_pH8Silane20heat_PC-A pH7-Silane-20-heat_254 A pH7-Silane-20-heat_254-A Intensity 20silaneheat
5 22_pH8Silane20heat_PC-B pH7-Silane-20-heat_254 B pH7-Silane-20-heat_254-B Intensity 20silaneheat
6 23_pH8Silane20heat_PC-C pH7-Silane-20-heat_254 C pH7-Silane-20-heat_254-C Intensity 20silaneheat
7 24_pH8Silane20heat_PC-D pH7-Silane-20-heat_254 D pH7-Silane-20-heat_254-D Intensity 20silaneheat
8 25_pH8Silane30heat_NC-A pH7-Silane-30-heat_nCL A pH7-Silane-30-heat_nCL-A Intensity 30silaneheat
9 26_pH8Silane30heat_NC-B pH7-Silane-30-heat_nCL B pH7-Silane-30-heat_nCL-B Intensity 30silaneheat
10 27_pH8Silane30heat_NC-C pH7-Silane-30-heat_nCL C pH7-Silane-30-heat_nCL-C Intensity 30silaneheat
11 28_pH8Silane30heat_NC-D pH7-Silane-30-heat_nCL D pH7-Silane-30-heat_nCL-D Intensity 30silaneheat
12 29_pH8Silane30heat_PC-A pH7-Silane-30-heat_254 A pH7-Silane-30-heat_254-A Intensity 30silaneheat
13 30_pH8Silane30heat_PC-B pH7-Silane-30-heat_254 B pH7-Silane-30-heat_254-B Intensity 30silaneheat
14 31_pH8Silane30heat_PC-C pH7-Silane-30-heat_254 C pH7-Silane-30-heat_254-C Intensity 30silaneheat
15 32_pH8Silane30heat_PC-D pH7-Silane-30-heat_254 D pH7-Silane-30-heat_254-D Intensity 30silaneheat
16 33_pH8Silane20obd_NC-A pH7-Silane-20-obd_nCL A pH7-Silane-20-obd_nCL-A Intensity 20silaneobd
17 34_pH8Silane20obd_NC-B pH7-Silane-20-obd_nCL B pH7-Silane-20-obd_nCL-B Intensity 20silaneobd
18 35_pH8Silane20obd_NC-C pH7-Silane-20-obd_nCL C pH7-Silane-20-obd_nCL-C Intensity 20silaneobd
19 36_pH8Silane20obd_NC-D pH7-Silane-20-obd_nCL D pH7-Silane-20-obd_nCL-D Intensity 20silaneobd
20 37_pH8Silane20obd_PC-A pH7-Silane-20-obd_254 A pH7-Silane-20-obd_254-A Intensity 20silaneobd
21 38_pH8Silane20obd_PC-B pH7-Silane-20-obd_254 B pH7-Silane-20-obd_254-B Intensity 20silaneobd
22 39_pH8Silane20obd_PC-C pH7-Silane-20-obd_254 C pH7-Silane-20-obd_254-C Intensity 20silaneobd
23 40_pH8Silane20obd_PC-D pH7-Silane-20-obd_254 D pH7-Silane-20-obd_254-D Intensity 20silaneobd
24 41_pH8Silane30obd_NC-A pH7-Silane-30-obd_nCL A pH7-Silane-30-obd_nCL-A Intensity 30silaneobd
25 42_pH8Silane30obd_NC-B pH7-Silane-30-obd_nCL B pH7-Silane-30-obd_nCL-B Intensity 30silaneobd
26 43_pH8Silane30obd_NC-C pH7-Silane-30-obd_nCL C pH7-Silane-30-obd_nCL-C Intensity 30silaneobd
27 44_pH8Silane30obd_NC-D pH7-Silane-30-obd_nCL D pH7-Silane-30-obd_nCL-D Intensity 30silaneobd
28 46_pH8Silane30obd_PC-B pH7-Silane-30-obd_254 B pH7-Silane-30-obd_254-B Intensity 30silaneobd
29 46_pH8Silane30obd_PC-F pH7-Silane-30-obd_254 F pH7-Silane-30-obd_254-F Intensity 30silaneobd
30 47_pH8Silane30obd_PC-C pH7-Silane-30-obd_254 C pH7-Silane-30-obd_254-C Intensity 30silaneobd
31 48_pH8Silane30obd_PC-D pH7-Silane-30-obd_254 D pH7-Silane-30-obd_254-D Intensity 30silaneobd
In [ ]:
MQ_expt296 = jwrangle.MQ_writeMetadata(pGroups, evidence, exp_df, 'e296', cwd)

3. Re-Annotate the MaxQuant pGroups with stable Gene IDs

Functions
jweb.mapAnyID( )
jwrangle.importMixedFiles( )

MaxQuant does a good job of assigning a Gene name to each protein group. Presumably these gene names come from the FASTA. However:

  • Sometimes it fails to find a gene name
  • Sometimes it will assign an ID that is not a gene or include out-of-place characters
  • It doesn't always seem to be consistent
  • If the gene name originates from the FASTA then repeating the MQ search with an updated FASTA is the only way to update the gene IDs.
  • Use of a mapping service will standardise the ID conversion practices between my datasets and those of others, including RNA-Seq.

To avoid these problems we will remap the Majority protein IDs to ENTREZ gene IDs. jweb.mapAnyID( ) will retrieve all possible genes for each protein group, and will also select a primary ID to singularly represent the group by a consistent method. This is a very flexible function, see help( ) for further explanation. From this point, the MQ 'Gene names' column will no longer be necessary. This function can also handle ID mapping to and from almost any convention.

Ensuring our proteins have a consistent gene naming strategy is essential for inter-experiment comparison and the later use of set methods. It also creates a standard that can be applied for accurately mapping RNA-Seq results and thus aid in future mapping of protein-RNA partners.

In [2]:
#### If not already loaded, read in the metadata-adjusted files
metadata = pd.read_csv(cwd / 'metadata' / 'e296_metadata.csv', index_col = 0)
pGroups = pd.read_csv(cwd / 'MaxQuant' / 'e296_proteinGroups_metalabeled.txt', delimiter = '\t')
evidence = pd.read_csv(cwd / 'MaxQuant' / 'e296_evidence_metalabeled.txt', delimiter = '\t')
C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (321,322) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (53,54,61) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [3]:
#### Dynamically remap gene names in our proteinGroups file and save a copy
pGroups_remap = jweb.mapAnyID_gPro(pGroups['Majority protein IDs'].tolist(), splitstr = [';', '-'], geneProductType = 'protein', 
                              gConvertOrganism = 'hsapiens', gConvertTarget = 'ENTREZGENE', writetopath = [cwd, 'pGroups_remap'], writeTargetsAsList = 'NO')
In [4]:
#### If not already loaded, read in the remapped proteinGroups file
pGroups_remap = jwrangle.importMixedFiles(cwd / 'downloads' / 'pGroups_remap', dropSuffix = 'yes')
pGroups_remap.keys()
Out[4]:
dict_keys(['id_map', 'query_map'])
In [5]:
#### jwrangle.importMixedFiles() returns a dictionary where keys = files. We want the 'id_map' table created by jweb.mapAnyID_gPro().
#### We'll rename the Query column and drop duplicates so the table can be merged with our proteinGroups table.
id_map = pGroups_remap['id_map'].rename(columns={'Query': 'Majority protein IDs'}).drop_duplicates()

#### Now use merge to add these new columns to our proteinGroups table
pGroups_map = pd.merge(pGroups, id_map, on='Majority protein IDs', how='left')

#### Check the tables are merged by viewing column elements from each.
pGroups_map[id_map.columns.tolist() + ['Peptide IDs']].head(2)
Out[5]:
Majority protein IDs ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name UNIPROT_gPro status Peptide IDs
0 A0A024R4E5;Q00341;Q00341-2;H0Y394 HDLBP;None HDLBP high density lipoprotein binding protein [Sour... SWISSPROT 57;711;1649;1671;1940;2102;2775;2824;3106;3440...
1 A0A024R4M0;P46781;B5MCT8;C9JM19 RPS9 RPS9 ribosomal protein S9 [Source:HGNC Symbol;Acc:H... SWISSPROT 6085;6086;10517;11131;11450;13657;14430;14702;...

4. Review Contaminants by Sample

Functions
jinspect.MQ_getContaminants( )
MQ_getContaminants_sbplot( )
jwrangle.importMixedFiles( )

We can extract the conaminants from our proteinGroups file using jinspect.MQ_getContaminants( ). These extracted table will return log2(iBAQ values).
Contaminants can then be reviewed with _MQ_getContaminantssbplot( ).

In [6]:
#### Extract contaminants
contaminants = jinspect.MQ_getContaminants(pGroups_map, metadata)

#### Sort the metadata into a more intuitive order
metadata_sort = metadata.sort_values(by = ['MQgroups','condition', 'replicate'], ascending=[True, False, True])

#### Visually inspect contaminants  
jvis.MQ_getContaminants_sbplot(contaminants, metadata_sort, width = 3, length = 2, layout = 'single')
<Figure size 432x288 with 0 Axes>

Results: GOOD

Very few contaminants in the samples that were heat eluted. Many contaminants in the samples that were subjected to on-bead digest. This is to be expected if selective precipition, based on the presence of a protein-RNA complex, is occurring.
This is because the on-bead digest samples will contain mostly pure precipitated protein.

5. Assess Digestion Efficiency

Functions
jinspect.MQ_getMissedCleavages( )
jvis.CommonPalettesAsHex
jvis.BarPlotByGroup_sbplot( )

Assessing missed cleavages is an essential metric for understanding the quality of the tryptic digestion. This data is recorded in the evidence file.
_jinspect.MQgetMissedCleavages( ) will return a long form data table that can easily be used for plotting.
The dictionary jvis.CommonPalettesAsHex contains a number of palettes that are common to both matplotlib and ggplot (from R). These are provided to ensure consistency is easy to achieve across both languages.
We'll plot the missed cleavages with the generic function jvis.BarPlotByGroup_sbplot( )

In [7]:
#### Extract the missed cleavage data into a long form table for plotting
MissedCleavages = jinspect.MQ_getMissedCleavages(evidence, metadata, drop_contaminants = True)
MissedCleavages.sort_values(by=['expt','sample'], inplace = True)
MissedCleavages
Out[7]:
sample % Missed Cleavages group expt
2 pH7-Silane-20-heat_254-A 24 pH7-Silane-20-heat_254 20silaneheat
12 pH7-Silane-20-heat_254-B 24 pH7-Silane-20-heat_254 20silaneheat
25 pH7-Silane-20-heat_254-C 24 pH7-Silane-20-heat_254 20silaneheat
29 pH7-Silane-20-heat_254-D 24 pH7-Silane-20-heat_254 20silaneheat
30 pH7-Silane-20-heat_nCL-A 22 pH7-Silane-20-heat_nCL 20silaneheat
26 pH7-Silane-20-heat_nCL-B 21 pH7-Silane-20-heat_nCL 20silaneheat
19 pH7-Silane-20-heat_nCL-C 16 pH7-Silane-20-heat_nCL 20silaneheat
3 pH7-Silane-20-heat_nCL-D 17 pH7-Silane-20-heat_nCL 20silaneheat
15 pH7-Silane-20-obd_254-A 31 pH7-Silane-20-obd_254 20silaneobd
13 pH7-Silane-20-obd_254-B 32 pH7-Silane-20-obd_254 20silaneobd
28 pH7-Silane-20-obd_254-C 32 pH7-Silane-20-obd_254 20silaneobd
11 pH7-Silane-20-obd_254-D 32 pH7-Silane-20-obd_254 20silaneobd
17 pH7-Silane-20-obd_nCL-A 35 pH7-Silane-20-obd_nCL 20silaneobd
27 pH7-Silane-20-obd_nCL-B 34 pH7-Silane-20-obd_nCL 20silaneobd
21 pH7-Silane-20-obd_nCL-C 34 pH7-Silane-20-obd_nCL 20silaneobd
6 pH7-Silane-20-obd_nCL-D 34 pH7-Silane-20-obd_nCL 20silaneobd
24 pH7-Silane-30-heat_254-A 25 pH7-Silane-30-heat_254 30silaneheat
20 pH7-Silane-30-heat_254-B 25 pH7-Silane-30-heat_254 30silaneheat
8 pH7-Silane-30-heat_254-C 25 pH7-Silane-30-heat_254 30silaneheat
16 pH7-Silane-30-heat_254-D 26 pH7-Silane-30-heat_254 30silaneheat
23 pH7-Silane-30-heat_nCL-A 18 pH7-Silane-30-heat_nCL 30silaneheat
18 pH7-Silane-30-heat_nCL-B 18 pH7-Silane-30-heat_nCL 30silaneheat
9 pH7-Silane-30-heat_nCL-C 20 pH7-Silane-30-heat_nCL 30silaneheat
14 pH7-Silane-30-heat_nCL-D 19 pH7-Silane-30-heat_nCL 30silaneheat
5 pH7-Silane-30-obd_254-B 22 pH7-Silane-30-obd_254 30silaneobd
22 pH7-Silane-30-obd_254-C 33 pH7-Silane-30-obd_254 30silaneobd
0 pH7-Silane-30-obd_254-D 33 pH7-Silane-30-obd_254 30silaneobd
7 pH7-Silane-30-obd_254-F 32 pH7-Silane-30-obd_254 30silaneobd
4 pH7-Silane-30-obd_nCL-A 35 pH7-Silane-30-obd_nCL 30silaneobd
31 pH7-Silane-30-obd_nCL-B 35 pH7-Silane-30-obd_nCL 30silaneobd
1 pH7-Silane-30-obd_nCL-C 36 pH7-Silane-30-obd_nCL 30silaneobd
10 pH7-Silane-30-obd_nCL-D 35 pH7-Silane-30-obd_nCL 30silaneobd
In [8]:
### Select a colour palette  
cpal = jvis.CommonPalettesAsHex

set2_paired = []
for i in cpal['Set2_qual']:
    set2_paired.append(i)
    set2_paired.append(i)

#### Plot the grouped data points  
sns.set_style('whitegrid')
jvis.BarPlotByGroup_sbplot(MissedCleavages, x_col = 'group', y_col = '% Missed Cleavages', title = '% Missed Cleavages', pal = set2_paired)
<Figure size 432x288 with 0 Axes>

Results: Average

Digestion efficiency isn't great. Let's reduce the digestion volume by 0.5x and maintain the same trypsin concentration in future experiments

6. Remove Contaminants

Functions
jwrangle.MQ_getThreePassFilter( )
SeqIO.parse( )

After QC we no longer want the contaminants in our data. jwrangle.MQ_getThreePassFilter( ) will remove reverse peptides, contaminants, and only identified by site from MQ tables.
The filter will also accept customised exclusion lists in case users have added odd protein species to the search FASTA tables. In this particular experiment we added to the human FASTA, RNAse proteins and the large T antigen.
The former as 1) a check that dynamic range is not being overwhelmed and 2) as an quantitative spike-in control to compare tryptic efficiency and the sample recovery across samples following C18 cleanup.

In [9]:
#### Map the location of the custom FASTA elements
os.listdir(base_path / 'my_resources' / 'FASTA')  

#### Create a list of the non-human proteins that were added to the custom FASTA genome search. 
new_cont = []
with open(base_path / 'my_resources' / 'FASTA' / "custom_proteome_elements.fasta", "r") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        new_cont.append(record.id.split('|')[1])

#### Remove all unwanted contaminants and IDs from the proteinGroups table      
pGroup_clean = jwrangle.MQ_getThreePassFilter(pGroups_map, custom_exclusion = new_cont)

#### Inspect the cleaned dataframe
pGroup_clean[['ENTREZGENE_gPro primary'] + [i for i in pGroup_clean.columns if 'iBAQ' in i]].head(2)
Out[9]:
ENTREZGENE_gPro primary iBAQ iBAQ pH7-Silane-20-heat_nCL-A iBAQ pH7-Silane-20-heat_nCL-B iBAQ pH7-Silane-20-heat_nCL-C iBAQ pH7-Silane-20-heat_nCL-D iBAQ pH7-Silane-20-heat_254-A iBAQ pH7-Silane-20-heat_254-B iBAQ pH7-Silane-20-heat_254-C iBAQ pH7-Silane-20-heat_254-D ... iBAQ pH7-Silane-20-obd_254-C iBAQ pH7-Silane-20-obd_254-D iBAQ pH7-Silane-30-obd_nCL-A iBAQ pH7-Silane-30-obd_nCL-B iBAQ pH7-Silane-30-obd_nCL-C iBAQ pH7-Silane-30-obd_nCL-D iBAQ pH7-Silane-30-obd_254-B iBAQ pH7-Silane-30-obd_254-F iBAQ pH7-Silane-30-obd_254-C iBAQ pH7-Silane-30-obd_254-D
0 HDLBP 1.044400e+09 0.0 0.0 0.0 0.0 9.678600e+07 7.080900e+07 9.100200e+07 9.025000e+07 ... 60376000.0 50760000.0 0.0 0.0 0.0 0.0 3174900.0 57683000.0 82764000.0 75414000.0
1 RPS9 1.689600e+10 0.0 0.0 0.0 0.0 1.421100e+09 1.117600e+09 1.870200e+09 2.090800e+09 ... 399610000.0 324730000.0 993560.0 1235400.0 0.0 0.0 36901000.0 420100000.0 579340000.0 495440000.0

2 rows × 34 columns

7. Drop Gene Duplicates and Filter Intensities by LFQ

Functions
jinspect.MQ_dropDuplicateIDs( )

The next step focuses on improving confidence in the quality of our data. This is done by applying jinspect.MQ_dropDuplicateIDs( ) which has the below effects:

  • Because one gene can have many proteins, sometimes Maxquant will create multiple proteinGroups for a single gene. As most of our analysis focuses on genes we'll trim the lowest quality proteinGroups duplicates from the table.
  • Standard LFQ defaults require a minimum of 2 peptide species, at least one of which must be unique, for quantitation to be applied. Intensity and iBAQ values, however, do not have such a minimum limit. I consider a 2 peptide minimum to be a wise filter but still have use for the Intensity and iBAQ values. Thus where the LFQ filter is applied all measurements that do not meet the minimum limit will be discarded. In short, if there isn't a companion LFQ value, there won't be an Intensity or iBAQ value either after filtering.
  • It has been documented that Match Between Runs suffers a high frequency of false peptide transfers (Lim, Paulo, Gygi 2019; doi: 10.1021/acs.jproteome.9b00492). At the protein level, however, this false transfer rate is greatly mitigated by the minimum peptide rule applied by the LFQ algorithm. This is another good reason for our filtering step.
In [10]:
#### Drop duplicates and apply LFQ filter
filter_dict = jinspect.MQ_dropDuplicateIDs(pGroup_clean, metadata, prefix = 'Peptides', ID = 'ENTREZGENE_gPro primary', pool = 'measure', drop_ID = 'None', 
                                            keep_PoolCalcs = False, applyLFQ_filter = ['Intensity', 'iBAQ'])

#### The df_keep value contains our targets, df_droprows conatins the discarded duplicates. Assign the df_keep value to a new variable and inspect.
pGroup_filtered = filter_dict['df_keep']
pGroup_filtered.head(2)
WARNING: jinspect.MQ_getMeasuredMeansByGroup() has not been checked
Out[10]:
Protein IDs Majority protein IDs Peptide counts (all) Peptide counts (razor+unique) Peptide counts (unique) Protein names Gene names Fasta headers Number of proteins Peptides ... iBAQ pH7-Silane-30-heat_nCL-C iBAQ pH7-Silane-30-heat_nCL-D iBAQ pH7-Silane-30-obd_254-B iBAQ pH7-Silane-30-obd_254-C iBAQ pH7-Silane-30-obd_254-D iBAQ pH7-Silane-30-obd_254-F iBAQ pH7-Silane-30-obd_nCL-A iBAQ pH7-Silane-30-obd_nCL-B iBAQ pH7-Silane-30-obd_nCL-C iBAQ pH7-Silane-30-obd_nCL-D
0 A0A024R4E5;Q00341;Q00341-2;H0Y394;H7C0A4;C9JIZ... A0A024R4E5;Q00341;Q00341-2;H0Y394 56;55;50;39;18;15;13;11;10;10;10;10;9;6;6;5;5;... 56;55;50;39;18;15;13;11;10;10;10;10;9;6;6;5;5;... 56;55;50;39;18;15;13;11;10;10;10;10;9;6;6;5;5;... Vigilin HDLBP ;;; 23 56 ... 0.0 0.0 3174900.0 82764000.0 75414000.0 57683000.0 0.0 0.0 0.0 0.0
1 A0A024R4M0;P46781;B5MCT8;C9JM19;F2Z3C0;A8MXK4 A0A024R4M0;P46781;B5MCT8;C9JM19 16;16;13;13;4;4 16;16;13;13;4;4 16;16;13;13;4;4 40S ribosomal protein S9 RPS9 ;;; 6 16 ... 0.0 0.0 36901000.0 579340000.0 495440000.0 420100000.0 0.0 1235400.0 0.0 0.0

2 rows × 336 columns

8. Review Sample Clustering by Group

Functions
jtest.getDistanceMatrix( )
jvis.MQ_showDendrogramQC_mplplot( )

A distance matrix function jtest.getDistanceMatrix( ) is provided for users who wish to apply different algorithms or create different visualisations.
I like the 'ward' method for distance calculations and using a dengrogram to confirm that clustering matches expectations and so use a prerolled function jvis.MQ_showDendrogramQC_mplplot( )

In [11]:
#### Confirm that clustering matches expectations
jvis.MQ_showDendrogramQC_mplplot(pGroup_filtered, 'Intensity', metadata, 'QC clustering: ', grid = 'YES', fsize = (8, 8))

Results: INTERESTING

Clustering follows the expected pattern for all heat eluted samples. For the on-bead digestion, the separation is ok though not as convincing; there are some 254 samples straying into the nCL groups under both 20% and 30% ethanol capture conditions.
Remember that on-bead digestion was performed on the same sample following heat elution. If selective elution (not selective precipitation) is occurring then this process will produce some difference; because that difference is not so
defined as for the heated eluate, however, we can expect the on bead digestion approach to be less discriminatory.

9. Analyse Normalisation Effects by Sample

Functions
jwrangle.Log2_ByPrefix( )
jwrangle.MQ_poolMulti( )
jvis.ViolinCompare_sbplot( )

Here we review normalisation effects on each sample within the condition groups; these are most easily interpreted after log2 transformation. We will transform all measures of interest with _jwrangle.Log2ByPrefix( ) and then pool all the values of interest, by condition, with _jwrangle.MQpoolMulti( ). The function _jvis.ViolinComparesbplot( ) will let use compare Intensity distribution on a per sample basis.

Normalisation is applied to LFQ values by MaxQuant and is a feature of its handling of label-free data. I've not seen a detailed explanation of how it works though so it is a leap of faith that Cox and Mann have selected an appropriate method.
Normalisation must be applied separately to nCL and cCL groups. This is unusual though necessary to avoid outrageous results caused by having groups with extreme differences. See expt.313 for evidence.

In [12]:
#### Log2 transform available intensity values.
pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_filtered, 'LFQ intensity')
pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'iBAQ')
pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'Intensity')
pGroup_log2.replace(0,np.nan, inplace=True)

#### Create a long form dataset for each desired grouping
pool_SampleIntensity = jwrangle.MQ_poolMulti(pGroup_log2, metadata, melt_list = ['Intensity', 'LFQ intensity'], group = 'condition')
pool_SampleIntensity.keys()
Out[12]:
dict_keys(['pH7-Silane-20-heat_nCL', 'pH7-Silane-20-heat_254', 'pH7-Silane-30-heat_nCL', 'pH7-Silane-30-heat_254', 'pH7-Silane-20-obd_nCL', 'pH7-Silane-20-obd_254', 'pH7-Silane-30-obd_nCL', 'pH7-Silane-30-obd_254'])
In [13]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7-Silane-20-heat_nCL'], title = 'pH7-Silane-20-heat_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])
In [14]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7-Silane-20-heat_254'], title = 'pH7-Silane-20-heat_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])
In [15]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7-Silane-30-heat_nCL'], title = 'pH7-Silane-30-heat_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])
In [16]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7-Silane-30-heat_254'], title = 'pH7-Silane-30-heat_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])
In [17]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7-Silane-20-obd_nCL'], title = 'pH7-Silane-20-obd_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'][4:])
In [18]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7-Silane-20-obd_254'], title = 'pH7-Silane-20-obd_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'][4:])
In [19]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7-Silane-30-obd_nCL'], title = 'pH7-Silane-30-obd_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set1_qual'][2:])
In [20]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7-Silane-30-obd_254'], title = 'pH7-Silane-30-obd_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set1_qual'][2:])

Result: GOOD

Normalisation of heat + nCL samples is a bit wobbly (yes thats the proper word). This is to be expected, as we can later confirm that these samples have very few peptides present.
Normalisation of the 30% EtOH > Silane > On bead digest is quite dramatic; likely the result of technical variability in the precipitation efficiency onto the silane beads. This variability, however, is not present in the 20% version.

10. Compare Intensity Distribution and Sequence Coverage

Functions
jwrangle.MQ_poolDataByCondition( )
jvis.BoxPlotByColumn_sbplot( )

Next we will compare intensity and sequence coverage between groups. Log2 transformation has already been performed so we need only use jwrangle.MQ_poolDataByCondition( ) to create the appropriate long form dataset for plotting with jvis.BoxPlotByColumn_sbplot( ).

In [21]:
#### Pool data into a single long form dataset
pooled_dfDropGroupOne = jwrangle.MQ_poolDataByCondition(pGroup_log2, metadata_sort, prefix_list = ['Intensity', 'Sequence coverage'])

#### Compare Intensity distribution using a box and whisker plot
sns.set_style('whitegrid')
jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Intensity: ', 'Intensity')
In [22]:
#### Compare Sequence coverage using a box and whisker plot
sns.set_style('whitegrid')
jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Sequence coverage: ', 'Sequence coverage %')

Result: GOOD

The greatest distinction between nCL and cCL conditions is given by the heated elution sammples.

11. Compare Sum Peptide Counts

Functions
jinspect.MQ_getSumBySample( )
jvis.BarPlotByGroup_sbplot( )

To sum the total peptides observed across all proteins use _jinspect.MQgetSumBySample( ). These sums will be returned as a modified metadata table.
Plotting these by group is easily done with jvis.BarPlotByGroup_sbplot( ). The plotting order is determined by the metadata ordering.
In this case we are inspecting the number of peptides detected after having removed contaminants- thus if some spike-in proteins were removed, i.e. in this case RNAse treatments, they will not contribute to the peptide count. To look at the replicability of these spike-ins, we would reach back to the 'df_droprows' table generated by jinspect.MQ_dropDuplicateIDs( ) in section 7.

In [23]:
#### Extract the total peptides observed per sample
metaStats = jinspect.MQ_getSumBySample(pGroup_log2, metadata_sort, freqList = ['Peptides'], measure = False)

#### Plot the sum peptides
sns.set_style('whitegrid')
jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Peptides', title = 'Sum Peptides vs pH enrichment', pal = set2_paired,
                          errorbars = 'SEM')
<Figure size 432x288 with 0 Axes>

Results: Interesting

Consistent with suspicions from the sequence coverage and normalisation data, there are very few peptides in the nCL heated samples. These samples are juxtaposed by their cCL counterparts which have high yields similar to those of the on-bead digest condition.

12. Compare Unique Gene Counts

Functions
jinspect.MQ_getFrequencyBySample( )

One gene can encode for many proteins that often share regions of similarity. As for illumina-based RNA-Seq, however, shotgun proteomics can rarely assign a peptide species to a singular protein. In MaxQuant these are called proteinGroups. Because we have do not require protein-specific results, and gene identity is more stable, our gene count describes the groups to which our detected proteins have been be assigned. Thus gene here is being detected by protein product, just as it would be detected by RNA product in RNA Seq; none of these 3 are synonymous. To be clear, this is a count and not a measure.

Gene frequency is defined by the summed observations per protein regardless of intensity value and this data is extracted to our modified metadata with jinspect.MQ_getFrequencyBySample( ) .
A typical MQ search will yield identical protein counts (though different values) for Intensity and iBAQ*. LFQ frequencies will vary depending on the search settings:

  • In this case the MQ search has set LFQ values to be calculated on a min 2 peptide ratio (this is the default)**

Notes
* Why protein counts should be identical I don't know. The original iBAQ paper stipulates rules for the inclusion of a protein in the iBAQ calculation but MaxQuant doesn't seem to apply them.
** Previously I tested LFQ min ratio at 1 peptide. At 1 minimum peptide there was unexpected QC clustering. Possible explanations for this are explained in section 7 and are cleaned up by jinspect.MQ_dropDuplicateIDs( ) function. We can expect this function to greatly reduce qualifying IDs (~20% fewer), especially in the QE samples, but I think the trade-off is worth it because we gain 1) a more robust ID check and 2) the same search can be used for LFQ based checks of dynamic changes, i.e. comparing more than one group of cCL captures for biological changes.

In [24]:
#### Count the number of unique 
metaStats = jinspect.MQ_getFrequencyBySample(pGroup_log2, metaStats, freqList = ['Intensity', 'iBAQ', 'LFQ intensity'], measure = False)
metaStats
Out[24]:
experiment condition replicate sample measure MQgroups Peptides Intensity iBAQ LFQ intensity
0 17_pH8Silane20heat_NC-A pH7-Silane-20-heat_nCL A pH7-Silane-20-heat_nCL-A Intensity 20silaneheat 176.0 30 30 30
1 18_pH8Silane20heat_NC-B pH7-Silane-20-heat_nCL B pH7-Silane-20-heat_nCL-B Intensity 20silaneheat 152.0 31 31 31
2 19_pH8Silane20heat_NC-C pH7-Silane-20-heat_nCL C pH7-Silane-20-heat_nCL-C Intensity 20silaneheat 151.0 49 49 49
3 20_pH8Silane20heat_NC-D pH7-Silane-20-heat_nCL D pH7-Silane-20-heat_nCL-D Intensity 20silaneheat 123.0 50 50 50
4 21_pH8Silane20heat_PC-A pH7-Silane-20-heat_254 A pH7-Silane-20-heat_254-A Intensity 20silaneheat 13781.0 1254 1254 1254
5 22_pH8Silane20heat_PC-B pH7-Silane-20-heat_254 B pH7-Silane-20-heat_254-B Intensity 20silaneheat 13719.0 1252 1252 1252
6 23_pH8Silane20heat_PC-C pH7-Silane-20-heat_254 C pH7-Silane-20-heat_254-C Intensity 20silaneheat 13836.0 1284 1284 1284
7 24_pH8Silane20heat_PC-D pH7-Silane-20-heat_254 D pH7-Silane-20-heat_254-D Intensity 20silaneheat 13909.0 1527 1527 1527
8 33_pH8Silane20obd_NC-A pH7-Silane-20-obd_nCL A pH7-Silane-20-obd_nCL-A Intensity 20silaneobd 9868.0 697 697 697
9 34_pH8Silane20obd_NC-B pH7-Silane-20-obd_nCL B pH7-Silane-20-obd_nCL-B Intensity 20silaneobd 11367.0 750 750 750
10 35_pH8Silane20obd_NC-C pH7-Silane-20-obd_nCL C pH7-Silane-20-obd_nCL-C Intensity 20silaneobd 11211.0 757 757 757
11 36_pH8Silane20obd_NC-D pH7-Silane-20-obd_nCL D pH7-Silane-20-obd_nCL-D Intensity 20silaneobd 11843.0 960 960 960
12 37_pH8Silane20obd_PC-A pH7-Silane-20-obd_254 A pH7-Silane-20-obd_254-A Intensity 20silaneobd 15215.0 1177 1177 1177
13 38_pH8Silane20obd_PC-B pH7-Silane-20-obd_254 B pH7-Silane-20-obd_254-B Intensity 20silaneobd 15251.0 1192 1192 1192
14 39_pH8Silane20obd_PC-C pH7-Silane-20-obd_254 C pH7-Silane-20-obd_254-C Intensity 20silaneobd 15120.0 1208 1208 1208
15 40_pH8Silane20obd_PC-D pH7-Silane-20-obd_254 D pH7-Silane-20-obd_254-D Intensity 20silaneobd 15301.0 1470 1470 1470
16 25_pH8Silane30heat_NC-A pH7-Silane-30-heat_nCL A pH7-Silane-30-heat_nCL-A Intensity 30silaneheat 259.0 53 53 53
17 26_pH8Silane30heat_NC-B pH7-Silane-30-heat_nCL B pH7-Silane-30-heat_nCL-B Intensity 30silaneheat 263.0 58 58 58
18 27_pH8Silane30heat_NC-C pH7-Silane-30-heat_nCL C pH7-Silane-30-heat_nCL-C Intensity 30silaneheat 274.0 64 64 64
19 28_pH8Silane30heat_NC-D pH7-Silane-30-heat_nCL D pH7-Silane-30-heat_nCL-D Intensity 30silaneheat 262.0 88 88 88
20 29_pH8Silane30heat_PC-A pH7-Silane-30-heat_254 A pH7-Silane-30-heat_254-A Intensity 30silaneheat 14094.0 1302 1302 1302
21 30_pH8Silane30heat_PC-B pH7-Silane-30-heat_254 B pH7-Silane-30-heat_254-B Intensity 30silaneheat 13650.0 1258 1258 1258
22 31_pH8Silane30heat_PC-C pH7-Silane-30-heat_254 C pH7-Silane-30-heat_254-C Intensity 30silaneheat 13611.0 1285 1285 1285
23 32_pH8Silane30heat_PC-D pH7-Silane-30-heat_254 D pH7-Silane-30-heat_254-D Intensity 30silaneheat 13778.0 1532 1532 1532
24 41_pH8Silane30obd_NC-A pH7-Silane-30-obd_nCL A pH7-Silane-30-obd_nCL-A Intensity 30silaneobd 13320.0 833 833 833
25 42_pH8Silane30obd_NC-B pH7-Silane-30-obd_nCL B pH7-Silane-30-obd_nCL-B Intensity 30silaneobd 13634.0 857 857 857
26 43_pH8Silane30obd_NC-C pH7-Silane-30-obd_nCL C pH7-Silane-30-obd_nCL-C Intensity 30silaneobd 13214.0 846 846 846
27 44_pH8Silane30obd_NC-D pH7-Silane-30-obd_nCL D pH7-Silane-30-obd_nCL-D Intensity 30silaneobd 13933.0 1072 1072 1072
28 46_pH8Silane30obd_PC-B pH7-Silane-30-obd_254 B pH7-Silane-30-obd_254-B Intensity 30silaneobd 11263.0 994 994 994
29 47_pH8Silane30obd_PC-C pH7-Silane-30-obd_254 C pH7-Silane-30-obd_254-C Intensity 30silaneobd 16089.0 1219 1219 1219
30 48_pH8Silane30obd_PC-D pH7-Silane-30-obd_254 D pH7-Silane-30-obd_254-D Intensity 30silaneobd 16276.0 1521 1521 1521
31 46_pH8Silane30obd_PC-F pH7-Silane-30-obd_254 F pH7-Silane-30-obd_254-F Intensity 30silaneobd 16247.0 1219 1219 1219
In [25]:
#### Plot the counts
sns.set_style('whitegrid')
jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Intensity', title = '# Genes Detected By Group', pal = set2_paired, ylabel = 'Unique Genes', errorbars = 'SEM')
<Figure size 432x288 with 0 Axes>

Results: GOOD

The pattern of counts is similar for genes as it is for peptides.
I wonder if the identities of all cCL counts are similar. If so, does this suggest that the RNA anchor is responsible for dragging a crosslinked complex back into the heated solution, while free protein remains behind on the bead?
Were this the case, it would also suggest that the silane wash steps are important for parting RNA from non-crosslinked protein- thus the free protein is left behind on the bead to be found by the on-bead digest.

13. Assess Replicate Correlation

Functions
jwrangle.MQ_getSliceByPrefix( )
jvis.showPearsonRegression_altair( )

The function _jwrangle.MQgetSliceByPrefix( ) provides a convenient means of extracting values of a specific group.
We can then use _jvis.showPearsonRegressionaltair( ) to perform pairwise comparisons between each member of those groups. This function is specifically applied to genes with shared intensities- genes exclusive to one sample or the other, represented by vertical or horizontal datapoints, are plotted but excluded from the pearson calculation.

In [26]:
#### Extract the intensity values as a dictionary where keys = groups
Intensity_Dict = jwrangle.MQ_getSliceByPrefix(pGroup_log2, metadata, 'Intensity', group = 'condition', add_col = None)
Intensity_Dict.keys()
Out[26]:
dict_keys(['pH7-Silane-20-heat_nCL', 'pH7-Silane-20-heat_254', 'pH7-Silane-30-heat_nCL', 'pH7-Silane-30-heat_254', 'pH7-Silane-20-obd_nCL', 'pH7-Silane-20-obd_254', 'pH7-Silane-30-obd_nCL', 'pH7-Silane-30-obd_254'])
In [27]:
#### Check replicate consistency across all within group pairs
jvis.showPearsonRegression_altair(Intensity_Dict['pH7-Silane-20-heat_nCL'], mark_color = set2_paired[2])
jvis.showPearsonRegression_altair(Intensity_Dict['pH7-Silane-20-heat_254'], mark_color = set2_paired[2])
jvis.showPearsonRegression_altair(Intensity_Dict['pH7-Silane-20-obd_nCL'], mark_color = set2_paired[4])
jvis.showPearsonRegression_altair(Intensity_Dict['pH7-Silane-20-obd_254'], mark_color = set2_paired[4])
jvis.showPearsonRegression_altair(Intensity_Dict['pH7-Silane-30-heat_nCL'], mark_color = set2_paired[6])
jvis.showPearsonRegression_altair(Intensity_Dict['pH7-Silane-30-heat_254'], mark_color = set2_paired[6])
jvis.showPearsonRegression_altair(Intensity_Dict['pH7-Silane-30-obd_nCL'], mark_color = set2_paired[8])
jvis.showPearsonRegression_altair(Intensity_Dict['pH7-Silane-30-obd_254'], mark_color = set2_paired[8])

Results: GOOD

Replicates look good among 254nm cCL samples. More varied among the sparse nCL samples which should be expected.

14. Classify RBP

Functions
jinspect.MQ_getFrequencyByGroup()
jtest.MQ_applyClassifyRBP()

Before classifying our RBP we need to first tally the frequency with which each protein appears in each condition using _jinspect.MQgetFrequencyByGroup( )
Once done, we use _jtest.MQapplyClassifyRBP( ) to generate a dictionary from which each class can be reviewed or plotted.

In [28]:
#### Use the metadata and proteinGroups tables to count how many times a gene is identified in its group (/6). 
#### Here I demonstrate how we can count for all instances of Intensity, iBAQ and LFQ Intensity
pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_log2, metadata, 'iBAQ', group = 'condition')
pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'LFQ intensity', group = 'condition')
pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'Intensity', group = 'condition')
pGroup_Freq[['ENTREZGENE_gPro primary'] + [i for i in pGroup_Freq.columns if 'Freq' in i]].head(2)
Out[28]:
ENTREZGENE_gPro primary iBAQ Freq: pH7-Silane-20-heat_nCL iBAQ Freq: pH7-Silane-20-heat_254 iBAQ Freq: pH7-Silane-30-heat_nCL iBAQ Freq: pH7-Silane-30-heat_254 iBAQ Freq: pH7-Silane-20-obd_nCL iBAQ Freq: pH7-Silane-20-obd_254 iBAQ Freq: pH7-Silane-30-obd_nCL iBAQ Freq: pH7-Silane-30-obd_254 LFQ intensity Freq: pH7-Silane-20-heat_nCL ... LFQ intensity Freq: pH7-Silane-30-obd_nCL LFQ intensity Freq: pH7-Silane-30-obd_254 Intensity Freq: pH7-Silane-20-heat_nCL Intensity Freq: pH7-Silane-20-heat_254 Intensity Freq: pH7-Silane-30-heat_nCL Intensity Freq: pH7-Silane-30-heat_254 Intensity Freq: pH7-Silane-20-obd_nCL Intensity Freq: pH7-Silane-20-obd_254 Intensity Freq: pH7-Silane-30-obd_nCL Intensity Freq: pH7-Silane-30-obd_254
0 HDLBP 0 4 0 4 0 4 0 4 0 ... 0 4 0 4 0 4 0 4 0 4
1 RPS9 0 4 0 4 4 4 1 4 0 ... 1 4 0 4 0 4 4 4 1 4

2 rows × 25 columns

In [29]:
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.
RBP_Dict_20heat = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, '20silaneheat', 'pH7-Silane-20-heat_nCL', 'pH7-Silane-20-heat_254', 3, 
                                         add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])
All proteins classified = True
In [30]:
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.
RBP_Dict_20obd = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, '20silaneobd', 'pH7-Silane-20-obd_nCL', 'pH7-Silane-20-obd_254', 3, 
                                         add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])
All proteins classified = True
In [31]:
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.
RBP_Dict_30heat = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, '30silaneheat', 'pH7-Silane-30-heat_nCL', 'pH7-Silane-30-heat_254', 3, 
                                         add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])
All proteins classified = True
In [32]:
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.
RBP_Dict_30obd = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, '30silaneobd', 'pH7-Silane-30-obd_nCL', 'pH7-Silane-30-obd_254', 3, 
                                         add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])
All proteins classified = True
In [33]:
#### The results of this classifcation can be found in a dictionary of dataframes for each output
#### See help() for an explanation of each dataset
RBP_Dict_20heat.keys()
Out[33]:
dict_keys(['Input_df_annStatus', 'Summary_df_annStatus', 'I', 'IIa', 'IIb', 'IIc', 'NC', 'ND'])
In [34]:
#### We want the most general overview which of our classes per treatment and so concatenate the results from each Summary_df_annStatus
RBP_Class = pd.concat([RBP_Dict_20heat['Summary_df_annStatus'], RBP_Dict_20obd['Summary_df_annStatus'], RBP_Dict_30heat['Summary_df_annStatus'], RBP_Dict_30obd['Summary_df_annStatus']], axis=0, join='outer')

#### And represent them in a barplot
from matplotlib import pyplot as plt

plt.figure('rbp class')
sns.set_style('whitegrid')
sns.set(font_scale=1.1)
ax = sns.countplot(x='MQgroup', hue = 'RBP Class', data=RBP_Class.sort_values(by = ['MQgroup', 'RBP Class']), palette = cpal['RBP_Class'], edgecolor = 'black')
ax.set_ylabel('Unique Gene Count')
ax.set_title('Unique Genes detected per RBP Class')

RBP_Class.head(1)
Out[34]:
ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name Gene names RBP Class RBP subClass MQgroup
0 HDLBP;None HDLBP high density lipoprotein binding protein [Sour... HDLBP I 20silaneheat
In [35]:
#### Because each subclass of the class II RBP are identified based on different statistical assumptions we can more closely inspect if those assumptions trend differently among the different pH treatments.
plt.figure('rbp class')
sns.set_style('whitegrid')
sns.set(font_scale=1.1)
ax = sns.countplot(x='MQgroup', hue = 'RBP subClass', data=RBP_Class[RBP_Class['RBP subClass']!=''].sort_values(by=['MQgroup','RBP subClass']), palette = cpal['Set3_qual'], edgecolor = 'black')
ax.set_ylabel('Unique Gene Count')
ax.set_title('Unique Genes detected per RBP Class')

RBP_Class.head(1)
Out[35]:
ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name Gene names RBP Class RBP subClass MQgroup
0 HDLBP;None HDLBP high density lipoprotein binding protein [Sour... HDLBP I 20silaneheat

Results: GOOD

Consistent with the nCL vs cCL count disparity among the heated eluates it would appear that Class I identified RBPs constitute the majority of IDs.
This relationshiup is far less convincing for the on-bead digest samples with the Class II being the majority, many of which depended on fold change calculations for identification.

15. Compare RBP Identity Between Ethanol Treatments

Here we use the dictionaries output by jtest.MQ_applyClassifyRBP( ) for the purposes of creating a Venn Diagram.

The heated eluate samples were the clear winners, though it's worth investigating whether there is a difference in the genes identified between the 20% and 30% ethanol salt-bridging conditions.
Current incongruities, i.e. the use of salt-free washes prior to elution and also later experimental evidence (expt.311) revealed that salt bridging is unlikely to be the silane capture mechanism;
rather that selective precipitation is the true mechanism of selection. Indeed, the selection of 20% and 30% ethanol used in this protocol was based on early investigation into selective
precipitation (Expt.229). Either way, let's compare the overlap between Class I RBPs identified in 20% vs 30% EtOH.

In [36]:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn3, venn3_circles

plt.figure(figsize=(6,6))

v = venn3([set(RBP_Dict_20obd['NC']['ENTREZGENE_gPro primary']), 
           set(RBP_Dict_20heat['I']['ENTREZGENE_gPro primary']),
           set(RBP_Dict_30heat['I']['ENTREZGENE_gPro primary'])], 
           set_labels = ('On Bead Digest\nNot Classified', '20% Ethanol Capture\nHeated Elution', '30% Ethanol Capture\nHeated Elution'),
           alpha = 0.5)

c = venn3_circles([set(RBP_Dict_20obd['NC']['ENTREZGENE_gPro primary']), 
                   set(RBP_Dict_20heat['I']['ENTREZGENE_gPro primary']), 
                   set(RBP_Dict_30heat['I']['ENTREZGENE_gPro primary'])], linestyle='dashed')

c[0].set_lw(1.5)
c[0].set_ls('dotted')
c[1].set_lw(1.5)
c[1].set_ls('dotted')
c[2].set_lw(1.5)
c[2].set_ls('dotted')

for text in v.set_labels:
    text.set_fontsize(18)
for text in v.subset_labels:
    text.set_fontsize(14)

Results: GOOD

The overlap here is nearly complete. It is likely that 20% and 30% ethanol are largely equivalent.
The comparative lack of overlap with the group of non-classified genes from the on-bead digestion condition emphasizes that this similarity is unlikely to be random.

16. Review Basic Gene Ontology

Functions
jwrangle.importMixedFiles( )
jweb.fetchQuickGO_stats( )

In this section we will explore Gene Ontology (GO) memberships for the observed proteins. There is little use in applying statistical tests such as Gene Ontology Enrichment Analysis (GOEA) for these experiments; the combination of selection by RNA interaction and the comparative lack of deep RBP validation for many candidates would make such a study rather spurious. We can, however, investigate the frequency with which our identified RBPs appear in previous studies. In addition, we can use this frequency to further assess whether the 20% and 30% capture conditions used here are equivalent or not.

A number of GO-specific and utility functions are provided to help with retrieving Gene Ontologies from the QuickGo database. In this section we'll look at the most basic.
The function jweb.fetchQuickGO_stats( ) will fetch the annotation statistics for all records belonging to the gene ID from a submitted list. Reviewing these statistics before beginning an analysis is ideal, because it contextualises the breadth of future analyses. This function returns these statistics in the from of a dictionary which can be converted to a dataframe for easier viewing by using the dedicated function jweb.getQuickGO_stats( ).

In addition to contextualising the search space of subsequent analyses, fetching the annotation numbers is important for checking that the number of records, per GO ID, falls below 10000. This is because QuickGo will not allow larger searches to be done programmatically. If your GO ID of interest has many more records users should retrieve their records manually. These details will be covered fourther in the next section.

In [37]:
#### I like to keep a table of interesting GO terms in a local csv file, let's find it
MyResources = jwrangle.importMixedFiles(base_path / 'my_resources')
MyResources.keys()
Out[37]:
dict_keys(['.ipynb_checkpoints', 'control_elements', 'control_proteins', 'custom_dfs', 'FASTA', 'GOexplore.ipynb', 'GO_TermsOfInterest.csv', 'GO_TermsOfInterest_stats.csv', 'Hentze_core_RBP_list.txt', 'Hentze_core_RBP_list_uniprot.txt', 'Hentze_core_RBP_plusE1_8.txt', 'instrumentQC', 'limma_normalizeTIs_p3550_p3592_p3657.nb.html', 'limma_normalizeTIs_p3550_p3592_p3657.Rmd', 'TIs_p3550_p3592_p3657.csv', 'TIs_p3550_p3592_p3657_loess.csv'])
In [38]:
MyResources['GO_TermsOfInterest.csv'].head(9)
Out[38]:
GO Term relationship depth go ID Description ctrl list type parent(s)
0 binding Molecular Function 1 GO:0005488 NaN NaN GO:0003674
1 protein binding Molecular Function 2 GO:0005515 NaN NaN GO:0005488
2 chromatin binding Molecular Function 2 GO:0003682 NaN NaN GO:0005488
3 nucleic acid binding Molecular Function 3 GO:0003676 NaN NaN GO:0003676
4 DNA binding Molecular Function 4 GO:0003677 NaN general GO:0003676
5 RNA binding Molecular Function 4 GO:0003723 NaN general GO:0003676
6 DNA/RNA hybrid binding Molecular Function 4 GO:0071667 NaN general GO:0003676
7 translation regulator activity Molecular Function 4 GO:0090079 NaN NaN GO:0003676
8 regulatory region nucleic acid binding Molecular Function 4 GO:0001067 NaN NaN GO:0003676
In [39]:
#### Let's get basic statistics on the broader RNA-binding category "GO:0003723"  
MyGO_Stats_Dict = jweb.fetchQuickGO_stats(['GO:0003723'], QG_geneProductType = 'protein', QG_taxonId = '9606', QG_geneProductSubset = ['Swiss-Prot', 'TrEMBL'])
MyGO_Stats_Dict.keys()
Out[39]:
dict_keys(['GO0003723_Swiss-Prot', 'GO0003723_TrEMBL'])
In [40]:
#### To view basic information, like the number of annotations associated with the GO term
RNA_Binding_stats = jweb.getQuickGO_stats(MyGO_Stats_Dict)
RNA_Binding_stats
Out[40]:
go ID Swiss-Prot_annotations Swiss-Prot_genesProducts TrEMBL_annotations TrEMBL_genesProducts
0 GO:0003723 5245 1685 4976 2620

Result: GOOD

The SwissProt and TrEMBL counts are below 10000 records so, conveniently, we can continue the tutorial with the simplest scenario for record extraction.

17. Retrieve GO Records

Functions
jweb.fetchQuickGO( )
jwrangle.concatGO_DataFrameDict( )
jweb.mapQuickGO( )
jwrangle.AnnotateDataFrameCtrls( )
jinspect.MQ_getFrequencyBySample( )

We can see above that none of our downloaded IDs have exceeded our fetch limit of 10000 records. If any did, we would need to manually retrieve the tsv file.
For the GO terms we wish to investigate to investigate jweb.fetchQuickGO( ) will fetch the relevant proteins and also apply the same onsistent gene ID mapping algorithm we used earlier when remapping the MaxQuant IDs. This process is important because it allows us to provide consistent gene IDs to subsequent sets analysis. Where the writetopath options is used, the results from any searches will be saved to a local downloads folder- this is useful as a snapshot of the annotations used at the time or as a simple way of avoiding fetching the records via API in the future.

The function jwrangle.concatGO_DataFrameDict( ) let's us concatenate a dictionary created by jweb.fetchQuickGO( ). This is useful when the user wishes to pool both SwissProt and TrEMBL records for a given

The function jwrangle.AnnotateDataFrameCtrls( ) taks our QuickGo records and annotates each protein or gene in a given dataframe (i.e. a proteinGroups table) for membership to the searched GO ID. The function jinspect.MQ_getFrequencyBySample( ) acts similarly to the peptide and gene count functions used earlier- it will sum the frequency of memberships to each GO catgeory and report the counts as a modified metadata table.

A Note on Manual Downloads
The jweb.fetchQuickGO( ) will also output to console the html address that one should use if a manual download if required; this address will encompass all the relevant record characteristics that typically used by this suite. Once this table has been downloaded, the function jweb.mapQuickGO( ) can be used to provide the same ID mapping service as performed in jweb.fetchQuickGO( ). The returned elements will also be manipulated into the same format.

In [41]:
#### Download and create a local copy of all the records associated with our GO query.  
QuickGo_dict = jweb.fetchQuickGO(['GO:0003723'], QG_geneProductType = 'protein', QG_taxonId = '9606', QG_geneProductSubset = ['Swiss-Prot', 'TrEMBL'], gConvertOrganism='hsapiens', writetopath= True)
QuickGo_dict.keys()
GO0003723_Swiss-Prot has 5245 records: 
https://www.ebi.ac.uk/QuickGO/annotations?goId=GO:0003723&taxonId=9606&taxonUsage=exact&goUsage=descendants&geneProductType=protein&geneProductSubset=Swiss-Prot

GO0003723_TrEMBL has 4976 records: 
https://www.ebi.ac.uk/QuickGO/annotations?goId=GO:0003723&taxonId=9606&taxonUsage=exact&goUsage=descendants&geneProductType=protein&geneProductSubset=TrEMBL

Out[41]:
dict_keys(['GO0003723_Swiss-Prot', 'GO0003723_TrEMBL'])
In [42]:
### We'll assess our RBP frequency by considering both SwissProt and TrEmbl records, thus let's concatenate the dictionaries for GO:0003723
QuickGo_dict_concat = jwrangle.concatGO_DataFrameDict(QuickGo_dict)
QuickGo_dict_concat.keys()
Out[42]:
dict_keys(['GO0003723'])
In [43]:
#### With our GO records in hand we next investigate whether our identified proteins are members of GO:OOO3723
#### This can be done with any DataFrame, but here we'll use the last version of our modified proteinGroups table 
pGroup_GO = jwrangle.AnnotateDataFrameCtrls(pGroup_Freq, QuickGo_dict_concat, search_match = 'ENTREZGENE_gPro primary', dict_match = 'ENTREZGENE_gPro primary', none_col = 'GO_None')
pGroup_GO.keys()
D:\MEGA\Programming\Scripts_JS\RBP_SUITE\modules\jwrangle.py:43: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ctrl_nohits_df[none_col] = True                                                                           #add zero annotation column to annotation columns
Out[43]:
dict_keys(['ann_df', 'ann_subdf'])
In [44]:
#### The ann_df contains the master dataframe. The sub_df contains a slice for each positively identified Gene in each GO category searched.
#### Let's view what those membership annotations look like
pGroup_GO['ann_df'][['ENTREZGENE_gPro primary']+[i for i in list(pGroup_GO['ann_df'].columns) if 'GO' in i]].head(2)
Out[44]:
ENTREZGENE_gPro primary GO0003723 GO_None
0 HDLBP True False
1 RPS9 True False
In [45]:
#### Now to find out the counts for each GO ID being searched we will modify our metadata table in a fashion in the same way as for prptide and gene counting.
#### We'll use iBAQ for these counts but, given all intensities have previously been filtered by LFQ membership both 'Intensity' or 'LFQ intensity' would give the same result.
metaStatsGo = jinspect.MQ_getFrequencyBySample(pGroup_GO['ann_df'], metaStats, freqList = list(QuickGo_dict_concat.keys()) + ['GO_None'], measure = 'iBAQ')
metaStatsGo
Out[45]:
experiment condition replicate sample measure MQgroups Peptides Intensity iBAQ LFQ intensity GO0003723 GO_None
0 17_pH8Silane20heat_NC-A pH7-Silane-20-heat_nCL A pH7-Silane-20-heat_nCL-A Intensity 20silaneheat 176.0 30 30 30 22 8
1 18_pH8Silane20heat_NC-B pH7-Silane-20-heat_nCL B pH7-Silane-20-heat_nCL-B Intensity 20silaneheat 152.0 31 31 31 23 8
2 19_pH8Silane20heat_NC-C pH7-Silane-20-heat_nCL C pH7-Silane-20-heat_nCL-C Intensity 20silaneheat 151.0 49 49 49 35 14
3 20_pH8Silane20heat_NC-D pH7-Silane-20-heat_nCL D pH7-Silane-20-heat_nCL-D Intensity 20silaneheat 123.0 50 50 50 31 19
4 21_pH8Silane20heat_PC-A pH7-Silane-20-heat_254 A pH7-Silane-20-heat_254-A Intensity 20silaneheat 13781.0 1254 1254 1254 885 369
5 22_pH8Silane20heat_PC-B pH7-Silane-20-heat_254 B pH7-Silane-20-heat_254-B Intensity 20silaneheat 13719.0 1252 1252 1252 892 360
6 23_pH8Silane20heat_PC-C pH7-Silane-20-heat_254 C pH7-Silane-20-heat_254-C Intensity 20silaneheat 13836.0 1284 1284 1284 898 386
7 24_pH8Silane20heat_PC-D pH7-Silane-20-heat_254 D pH7-Silane-20-heat_254-D Intensity 20silaneheat 13909.0 1527 1527 1527 971 556
8 33_pH8Silane20obd_NC-A pH7-Silane-20-obd_nCL A pH7-Silane-20-obd_nCL-A Intensity 20silaneobd 9868.0 697 697 697 312 385
9 34_pH8Silane20obd_NC-B pH7-Silane-20-obd_nCL B pH7-Silane-20-obd_nCL-B Intensity 20silaneobd 11367.0 750 750 750 318 432
10 35_pH8Silane20obd_NC-C pH7-Silane-20-obd_nCL C pH7-Silane-20-obd_nCL-C Intensity 20silaneobd 11211.0 757 757 757 337 420
11 36_pH8Silane20obd_NC-D pH7-Silane-20-obd_nCL D pH7-Silane-20-obd_nCL-D Intensity 20silaneobd 11843.0 960 960 960 381 579
12 37_pH8Silane20obd_PC-A pH7-Silane-20-obd_254 A pH7-Silane-20-obd_254-A Intensity 20silaneobd 15215.0 1177 1177 1177 760 417
13 38_pH8Silane20obd_PC-B pH7-Silane-20-obd_254 B pH7-Silane-20-obd_254-B Intensity 20silaneobd 15251.0 1192 1192 1192 770 422
14 39_pH8Silane20obd_PC-C pH7-Silane-20-obd_254 C pH7-Silane-20-obd_254-C Intensity 20silaneobd 15120.0 1208 1208 1208 776 432
15 40_pH8Silane20obd_PC-D pH7-Silane-20-obd_254 D pH7-Silane-20-obd_254-D Intensity 20silaneobd 15301.0 1470 1470 1470 858 612
16 25_pH8Silane30heat_NC-A pH7-Silane-30-heat_nCL A pH7-Silane-30-heat_nCL-A Intensity 30silaneheat 259.0 53 53 53 38 15
17 26_pH8Silane30heat_NC-B pH7-Silane-30-heat_nCL B pH7-Silane-30-heat_nCL-B Intensity 30silaneheat 263.0 58 58 58 43 15
18 27_pH8Silane30heat_NC-C pH7-Silane-30-heat_nCL C pH7-Silane-30-heat_nCL-C Intensity 30silaneheat 274.0 64 64 64 48 16
19 28_pH8Silane30heat_NC-D pH7-Silane-30-heat_nCL D pH7-Silane-30-heat_nCL-D Intensity 30silaneheat 262.0 88 88 88 56 32
20 29_pH8Silane30heat_PC-A pH7-Silane-30-heat_254 A pH7-Silane-30-heat_254-A Intensity 30silaneheat 14094.0 1302 1302 1302 904 398
21 30_pH8Silane30heat_PC-B pH7-Silane-30-heat_254 B pH7-Silane-30-heat_254-B Intensity 30silaneheat 13650.0 1258 1258 1258 889 369
22 31_pH8Silane30heat_PC-C pH7-Silane-30-heat_254 C pH7-Silane-30-heat_254-C Intensity 30silaneheat 13611.0 1285 1285 1285 904 381
23 32_pH8Silane30heat_PC-D pH7-Silane-30-heat_254 D pH7-Silane-30-heat_254-D Intensity 30silaneheat 13778.0 1532 1532 1532 977 555
24 41_pH8Silane30obd_NC-A pH7-Silane-30-obd_nCL A pH7-Silane-30-obd_nCL-A Intensity 30silaneobd 13320.0 833 833 833 341 492
25 42_pH8Silane30obd_NC-B pH7-Silane-30-obd_nCL B pH7-Silane-30-obd_nCL-B Intensity 30silaneobd 13634.0 857 857 857 347 510
26 43_pH8Silane30obd_NC-C pH7-Silane-30-obd_nCL C pH7-Silane-30-obd_nCL-C Intensity 30silaneobd 13214.0 846 846 846 348 498
27 44_pH8Silane30obd_NC-D pH7-Silane-30-obd_nCL D pH7-Silane-30-obd_nCL-D Intensity 30silaneobd 13933.0 1072 1072 1072 413 659
28 46_pH8Silane30obd_PC-B pH7-Silane-30-obd_254 B pH7-Silane-30-obd_254-B Intensity 30silaneobd 11263.0 994 994 994 672 322
29 47_pH8Silane30obd_PC-C pH7-Silane-30-obd_254 C pH7-Silane-30-obd_254-C Intensity 30silaneobd 16089.0 1219 1219 1219 791 428
30 48_pH8Silane30obd_PC-D pH7-Silane-30-obd_254 D pH7-Silane-30-obd_254-D Intensity 30silaneobd 16276.0 1521 1521 1521 903 618
31 46_pH8Silane30obd_PC-F pH7-Silane-30-obd_254 F pH7-Silane-30-obd_254-F Intensity 30silaneobd 16247.0 1219 1219 1219 787 432
In [46]:
#### To plot these counts
jvis.BarPlotByGroup_sbplot(metaStatsGo, x_col = 'condition', y_col = 'GO0003723', title = 'GO:0003723, RNA-Binding', pal = set2_paired)
<Figure size 432x288 with 0 Axes>

Results: GOOD

That's a lot of RNA-binding proteins. But, for further validation, it will be far more useful to calculate the relative memberships, i.e. GO:0003723 vs None.

In [47]:
#### Calculate the % GO0003723 members per group
metaStatsGo['% RBP'] = metaStatsGo.apply(lambda row: round(100*row['GO0003723']/(row['GO0003723']+row['GO_None'])), axis = 1)

#### Plot
jvis.BarPlotByGroup_sbplot(metaStatsGo, x_col = 'condition', y_col = '% RBP', title = 'GO:0003723, RNA-Binding', pal = set2_paired, yrange = [0,100])
<Figure size 432x288 with 0 Axes>

Results: GOOD

Well the % RNA binding also looks quite similar between both 20% vs 30% EtOH, and nCL vs 254, heated eluates. This would also be consistent with the notion:

  • the 20% or 30% EtOH formulas are near equivalent.
  • only proteins that is RNA tethered will be eluted; the proteins that came off in the nCL group are likely stubborn interactors.

The difference between nCL and cCL obd samples could be explained by the nCL and cCL interphases being of different composition prior to bead capture. Thus, as previous experiments have shown, there is some capacity to discriminate at the interphase though the challenge is in further purifying the RNPs.

Conclusion

Simply put, the present protocol works well for global capture but precise elution by gentle heating is key to selecting for proteins that are RNA-bound.

The protocol most probably functions on the basis of selective precipitation rather than by the commonly believed salt-bridging method.

  • This does not disqualify salt bridging from having a role in convetional RNA extraction- elimination of salts in these protocls releases free RNA.
  • This might indicates that the selective precipitative method here simultaneously removes any free RNA during the salt free wash- this would mean the RNPs are wholly purified. They are free of both untethered RNA and untethered protein.
  • This claim, however, needs to be proven. To do so most simply:
    • Retain the first salt free wash. Quant the RNA, and SYPRO stain for protein.
    • From the heated eluate, Quant the RNA and SYPRO for protein.
    • Calculate the ratio of protein:RNA and compare between wash and eluate.